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Abstract 

The objective of change-point detection is to discover abrupt property changes ly- 
ing behind time-series data. In this paper, we present a novel statistical change- 
point detection algorithm based on non-parametric divergence estimation between 
time-series samples from two retrospective segments. Our method uses the rela- 
tive Pearson divergence as a divergence measure, and it is accurately and efficiently 
estimated by a method of direct density-ratio estimation. Through experiments 
on artificial and real-world datasets including human-activity sensing, speech, and 
Twitter messages, we demonstrate the usefulness of the proposed method. 

Keywords: change-point detection, distribution comparison, relative density-ratio 
estimation, kernel methods, time-series data 
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1 Introduction 

Detecting abrupt changes in time-series data, called change-point detection, has 
attracted researchers in the statistics and data mining communities for decades 



( IBasseville and Nikiforovl . Il993l ; iGustafssonl . l2000t iBrodsky and Darkhovskyl . 119931 ) . De- 
pending on the delay of detection, change -point detection met h ods c a n be classified 
into t wo categ o ries: Real-time detection (jAdams and MacKavl. 120071: iGarnett et al. 



200 91: iPaauetl. 120071) a nd retrospective detection (IBasseville and Nikiforovl . [1993; 



Takeuchi and Yamanishil . 120061 ; iMoskvina and Zhigljavskyl . l2003af ) . 

Real-time change-point detection targets applications that require immediate re- 
sponses such as robot control. On the other hand, although retrospective change-point 
detection requires longer reaction periods, it tends to give more robust and accurate de- 
tection. Retrospective change-point detection accommodat es various applicat ions that 



allow certain delays, for example, clima te change detection (IReeves et al.l. 120071). genetic 



201ll ). signal segme ntation f|Basseville and Nikiforov 



time- series analysis ( jWang et al 

19931 ). and intrusion detection in computer networks ( JYamanishi et al 



20001 ) . In this 



paper, we focus on the retrospective change-point detection scenario and propose a novel 
non-parametric method. 

Having been studied for decades, some pioneer works demonstrated good change- 
point detection performance by compari ng the probability distribution s of time-series 
samples over past and present intervals ( IBasseville and Nikiforovl . Il993l ). As both the 
intervals move forward, a typical strategy is to issue an alarm for a change point when 
the two distributions are becoming significantly different. Vario us change-point detection 
meth ods follow this strategy, for example, the cumulative sum (JBasseville and Nikiforovl . 
1993 ), the generalized like l ihood -ratio method ( IGustafssonl . Il996l ). and the change finder 



(JTakeuchi and Yamanishil. 120061). Such a strategy has also be en employed in n ovelty 



detection (jGuralnik and Srivastaval . Il999l ) and outlier detection (IHido et all . l201ll ). 

Another group of met hods that have attracted h igh po pu larity in recent years 



is the subspace meth ods (IMoskvina and Zhigljavskyl . l2003al lbl: llde and Tsudal . 12007 : 
Kawahara et all 120071 ). By using a pre-designed time-series model, a subspace is discov- 
ered by principal component analysis from trajectories in past and present intervals, and 
their dissimilarity is measured by the distance between the subspaces. One of the major 
approaches is called subspace identification, which compares the subspaces spanned by the 
colum ns of an extended obser vability matrix generated by a state-space model with system 



noise (IKawahara et al.l . 120071 ). Recent efforts along this line of resea rch have led to a com - 



putationally efficient algorithm based on Krylov subspace learning (llde and Tsudal. 120071) 
and a successful application of detecting climate change in south Kenya (lltoh and Kurthsl . 
2010h . 

The methods explained above rely on pre-designed param e tric models, such 



as u nderlying probability dis tributions (JBasseville and Nikiforovl . Il993l : iGustafsson 



1996 ), auto- regressive mod e ls dTakeuchi and Yamanishil. I2006T). and state-space m odels 



(IMoskvina and Zhigljavskyl . l2003al lbl: llde and Tsudal . 120071 : IKawahara et all 120071 ). for 



tracking specific statistics such as the mean, the variance, and the spectrum. As alter- 
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Figure 1: Rationale of direct density-ratio estimation. 



natives, non-parametric methods such as kernel density estimation (ICsorgo and Horvathl . 
19881 ; iBrodsky and Darkhovskyl . Il993l ) are designed with no particular parametric assump- 
tion. However, they tend to be les s accurate in high- dimensiona l problems because of the 
so-called curse of dimensionality (jBellmanl . 11961c IVapnikl . 119981 ) . 

To overcome this difficulty, a new strategy was introduced recently, which esti- 
mate s the ratio of probabilit y densities directly without going through density estima- 
tion ( ISugiyama et al.l . |2012bJ). The rationale of this density-ratio estimation idea is that 
knowing the two densities implies knowing the density ratio, but not vice versa; knowing 
the ratio does not necessarily imply knowing the two densities because such decomposi- 
tion is not unique (Figure ED). Thus, direct density -ratio estimation is substantially easier 
than density estimation ( ISugiyama et al.l . |2012b|). Following this idea, m ethods of di- 



rect density-rati o estimation have be en developed (ISugiyama et al. 



mean matching (IGretton et al.l . 120091 ) . the logistic-regression method flBickel et al.l. 120071) 



2012al). e.g.. kern el 



and the Kullback-Leibler importance estimation procedure (KLIEP) (iSugiyama et al. 



20081 ). In the con text of change-point dete c tion, KLIEP was reported to outperform 
other approac hes (iKawahara and Sugivamal. 120121) such as the one-class support vec- 



tor machine (IScholkopf et al. 



2001 



Desobry et al.l . 120051 ) and singular-spectrum anal- 



ysis ( iMoskvina and Zhigljavskyl . |2003bJ). Thus, change-point detection based on direct 
density-ratio estimation is promising. 

The goal of this paper is to further advance this line of research. More specifi- 
cally, our contributions in this paper are two folds. The first contribution is to apply a 
recently-proposed density-r atio estimation method called the unconstrained least-squares 
importance fitting (uLSIF) (IKanamori et al.l . 120091 ) to change-point detection. The basic 
idea of uLSIF is to directly learn the density-ratio function in the least-squares fitting 
frame work. Notable adyantag es of uLSIF are that its solution can be computed analyt- 
i cally (JKanamori et al.1. J20091) . it achieves the optimal non-para metric convergence rat e 
(IKanamori et al.l . l2012bl ). it has the opti mal numerical stability (IKanamori et al.1 . I2013T ). 
and it has higher robustness than KLIEP (ISugiyama et al.l . l2012al ). Through experiments 
on a range of datasets, we demonstrate the superior detection accuracy of the uLSIF-based 
change-point detection method. 

The second contribution of this paper is to further improve the uLSIF-based change- 
point detection m ethod by employing a state-of-the-art extension of uLSIF called relative 
uLSIF (RuLSIF) (jYamada et al.1 . 120131 ). A potential weakness of the density-ratio based 
approach is that density ratios can be unbounded (i.e., they can be infinity) if the denom- 
inator density is not well-defined. The basic idea of RuLSIF is to consider relative density 
ratios, which are smoother and always bounded from above. Theoretically, it was proved 
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-> y(t): set of n retrospective subsequences: 



, Time y(t):={Y(t),Y(t + l),...,Y(t + n-l)} 

y(t)-> <—y(t + n) 

Figure 2: An illustrative example of notations on one-dimensional time-series data. 
that RuLSIF possesse s a superior non-parametric convergence property than plain uLSIF 



(jYamada et all 120131 ) . implying that RuLSIF gives an even better estimate from a small 
number of samples. We experimentally demonstrate that our RuLSIF-based change-point 
detection method compares favorably with other approaches. 

The rest of this paper is structured as follows: In Section [2J we formulate our change- 
point detection problem. In Section [31 we describe our proposed change-point detection 
algorithms based on uLSIF and RuLSIF, together with the review of the KLIEP-based 
method. In Section HI we report experimental results on various artificial and real-world 
datasets including human-activity sensing, speech, and Twitter messages from February 
2010 to October 2010. Finally, in Section [5j conclusions together with future perspectives 
are stated. 



2 Problem Formulation 

In this section, we formulate our change-point detection problem. 
Let y(t) G IR d be a d- dimensional time-series sample at time t. Let 

Y(t) := [y(t) T , y(t + 1) T , ...,y(t + k- 1) T ] T G R dk 

be a "subsequence"!!! of time series at time t with length fc, where T represents the trans 



pose. Following the previous work (IKawahara and Sugiyamal . |2012j), we treat the subse 



quence Y(t) as a sample, instead of a single d- dimensional time-series sample y(t), by 
which time-dependent information can be incorporated naturally (see Figure [2]). Let y(t) 
be a set of n retrospective subsequence samples starting at time t: 

y(t) := [Y(t), Y(t + l),...,Y(t + n-l)}. 

Note that [Y(t), Y(t + 1), . . . , Y(t + n — 1)] G R dkxn forms a Hankel matrix and plays a 



key ro l e in change-point detec tion based on subspace learning flMoskvina and Zhigljavsky 



2003al : IKawahara et all 120071 ) 



^n fact, only in the case of one-dimensional time-series, Y(t) is a subsequence. For higher-dimensional 
time-series, Y(i) concatenates the subsequences of all dimensions into a one-dimensional vector. 
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For change-point detection, let us consider two consecutive segments y(t) and y(t+n). 
Our strategy is to compute a certain dissimilarity measure between y(t) and y(t + n), and 
use it as the plausibility of change points. More specifically, the higher the dissimilarity 
measure is, the more likely the point is a change pointo 

Now the problems that need to be addressed are what kind of dissimilarity measure 
we should use and how we estimate it from data. We will discuss these issues in the next 
section. 

3 Change- Point Detection via Density- Ratio Estima- 
tion 

In this section, we first define our dissimilarity measure, and then show methods for 
estimating the dissimilarity measure. 



3.1 Divergence-Based Dissimilarity Measure and Density-Ratio 
Estimation 



In this paper, we use a dissimilarity measure of the following form: 



D{Pt\\Pt 



t+n) 



D{P t+ n\\Pt, 



i) 



where Pt and Pt+ n are probability distribu tions of samples in y(t) and y(t + ri), respec- 
tively. D(P\\P') denotes the f -divergence (jAli and Silveyi . Il966l ; ICsiszarl . Il967l ): 



D{P\\P') := / p'(Y)f 



P(Y) 
p'(Y) 



dY, 



(2) 



where / is a convex function such that /(l) = 0, and p(Y) and p'(Y) are probability 
density functions of P and P', respectively. We assume that p(Y) and p'{Y) are strictly 
positive. Since the /-divergence is asymmetric (i.e., D(P\\P') ^ D(P'\\P)), we symmetrize 
it in our dissimilarity measure (CQ) for all divergence-based methodgj 

The /-divergence includes vario us popular divergences such as the Kullback-Leibler 
(KL) divergence by f(t) = tlogt (jKullback and Leiblerl . Il95ll ) and the Pearson (PE) 



2 Another possible formulatio n is to c ompa re distributions of samples in y(t) and y(t + n) in the 
framework of hypothesis testing ( Henkel 119761 ). Although this gives a useful threshold to determine 
whether a point is a change point, computing t he p- value is often time consuming, particularly in a non- 
parametric setup ( Efron and Tibshiranl 119931 ). For this reason, we do not take the hypothesis testing 
approach in this paper, although it is methodologically straightforward to extend the proposed approach 
to the hypothesis testing fra mework. 

3 In the previous work ( Kawahara and Sugivamal 120121 ). the asymmetric dissimilarity measure 
D{P t \\P t+n ) was used. As we numerically illustrate in Section 0] the use of the symmetrized divergence 
contributes highly to improving the performance. For this reason, we decided to use the symmetrized 
dissimilarity measure §Q. 
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divergence by /(£) = |(t — l) 2 fjPearsonl . Il90dl ): 

KL(P\\P>):= J p(Y) log (^^dY 

■p(Y) 



vHp\\pr.= - 2 ]v\Y) Kpi{Y) 



AY. 



(3) 
(4) 



Since the probability densities p(Y) and p'(Y) are unknown in practice, we cannot 
directly compute the /-divergence (and thus the dissimilarity measure). A naive way to 
cope with this problem is to perform density estimation and plug the estimated densities 
p(Y) and j?(Y) in the definit ion of the /-d ivergence. However, density estimation is 
known to be a hard problem (jVapnikl . Il998l ). and thus such a plug-in approach is not 
reliable in practice. 

Recently, a novel meth od of divergence approximation base d on d i rect density-ratio 
estim ation was explored (jSugiyama et all 120081 ; iNguyen et all l2010l ; iKanamori et all 
20091 ). The basic idea of direct density-ratio estimation is to learn the density-ratio 



function 3ry) without going through separate density estimation of p(Y) and p'(Y). An 
intuitive rationale of direct density-ratio estimation is that knowing the two densities p(Y) 
and p'(Y) means knowing their ratio, but not vice versa; knowing the ratio ,/ Y \ does not 
necessarily mean knowing the two densities p(Y) and p'(Y) because such decomposition 
is not unique (see Figured]). This implies that estimating the density ratio is substantially 
easier than estimatin g the densities, and thu s directly estimating the density ratio would 
be more promising^ ( Sugiyama et all |2012bJ). 

In the rest of this section, we review three methods of directly estimating the density 



ratio 



p(X) 

p'(Y) 



from samples {"Ki}™ =1 and {Y'^ =l drawn from p(Y) and p'{Y)\ The KL 



r ^ 1 , h — 1 I h. r 

importance estimation procedure (KLIEP) (ISugivama et all 120081) i n Sec tion I3.2i uncon 
strained least-squares importa nce fitting (uLSIF) (jKanamori et al.l . 120091 ) in Section 
and relative uLSIF (RuLSIF) (jYamada et all . 120131 ) in Section 13.41 



3.2 KLIEP 



KLIEP (JSugiyama et all 120081 ) is a direct density-ratio estimation algorithm that is suit- 
able for estimating the KL divergence. 



4 Vladimir Vapnik advocated in his seminal book ( Vapnikl . ll998h that one should avoid solving a more 
difficult problem as an intermediate step. The support vector machine (jCortes and Vapnikl . 119951) is a 
representative example that demonstrates the usefulness of this principle: It avoids solving a more general 
problem of estimating data generating probability distributions, and only learns a decision boundary 
that is sufficient for pattern recognition. The idea of direct density-ratio estimation also follows Vapnik's 
principle. 
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Let us model the density ratio -^Wtt by the following kernel model 



3.2.1 Density-Ratio Model 

Pi 

p'{Y) 

it 

g(Y;6):=J2^K(Y,Y e ), (5) 



l=x 



where 6 := (Oi, . . . , 6 n ) T are parameters to be learned from data samples, and K(Y, Y') 
is a kernel basis function. In practice, we use the Gaussian kernel: 

/ \\Y-Y'\\ 2 
K(Y, Y') = exp f- " ^ " 

where a (> 0) is the kernel width. In all our experiments, the kernel width a is determined 
based on cross-validation. 

3.2.2 Learning Algorithm 

The parameters 6 in the model g(Y; 6) are determined so that the KL divergence from 
p(Y) to g(Y; 0)p'(Y) is minimized: 

= Jp(Y)\og(^) AY - J p(Y)\og(g(Y;8)) AY 

After ignoring the first term which is irrelevant to g(Y; 6) and approximating the second 
term with the empirical estimates, the KLIEP optimization problem is given as follows: 

n / n \ 

max -J] log \Y,e t K{Y u Y t ) , 

n i=i \e=i J 

1 n n 

s.t. -J2Y1 e R ( Y 'p Y i) = 1 and & i> ■ ■ ■ » °n > 0. 

j=l 1=1 

The equality constraint is for the normalization purpose because g(Y; 6)p'(Y) should be 
a probability density function. The inequality constraint comes from the non-negativity 
of the density-ratio function. Since this is a convex optimization problem, the unique 
global optimal solution 6 can be simply obtained, for example, by a gradient-projection 
iteration. Finally, a density-ratio estimator is given as 

n 

g(Y) = J20iK(Y,Y e ). 
l=i 

KLIEP was shown to achieve the optimal non-parametric convergence rate 



( ISugivama et all l2008t iNguven et all 120101 ). 
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3.2.3 Change-Point Detection by KLIEP 

Given a density-ratio estimator ~g(Y), an approximator of the KL divergence is given as 



-— 1 n 

KL:=-Vlog?(T^). 
n t-^ 



In the previous work (IKawahara and Sugiyamal . 120121 ). this KLIEP-based KL- 
divergence estimator was applied to change-point detection and demonstrated to be 
promising in experiments. 



3.3 uLSIF 

Recently, anoth er di rect density-ratio estimator called uLSIF was proposed 

( Kanamori et all 120091 . l2012bl ). which is suitable for estimating the PE divergence. 



3.3.1 Learning Algorithm 



In uLSIF, the same density-ratio model as KLIEP is used (see Section I3.2.ip . However, 
its training criterion is different; the density-ratio model is fitted to the true density- 
ratio under the squared loss. More specifically, the parameter 6 in the model g(Y; 0) is 
determined so that the following squared loss J(Y) is minimized: 



J{Y) 



P(Y) 
P'(Y] 

P(Y) 
P'(Y) 



g(Y;6)j p'(Y)dY 

P '{Y) dY - J P (Y)g(Y- 6) dY + l - J g(Y; 0) 2 p'(Y) dY. 



Since the first term is a constant, we focus on the last two terms. By substituting g(Y; 6) 
with our model stated in (|3J) and approximating the integrals by the empirical averages, 
the uLSIF optimization problem is given as follows: 



mm 



~O T H0-h r 0+-0 T O 



(6) 



where the penalty term ^0 6 is included for a regularization pu rpose. A (> 0) denote s 



the regularization parameter, which is chosen by cross-validation (jSugiyama et all 120081 ) . 
H is the n x n matrix with the (£, £')-th element given by 



1 n 



(7) 



i=i 
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h is the n- dimensional vector with the £-th element given by 

1 n 
h t :=-J2K(Y i ,Y l ). 

j=i 
It is easy to confirm that the solution 6 of ([6]) can be analytically obtained as 

e = (H + xi n y 1 h, (8) 

where J n denotes the n-dimensional identity matrix. Finally, a density-ratio estimator is 
given as 

n 

g(Y) = J2^K(y^i)- 

3.3.2 Change-Point Detection by uLSIF 

Given a density-ratio estimator ~g(Y), an approximator of the PE divergence can be 
constructed as 



7 = 1 4=1 



This approxima t or is derive d from the following expression of the PE divergence 
JSugiyama et~aD . l2010l IJOllbh : 



PE(P||P') 



P(Y) 
p'(Y) 



' P ' {Y)dY + l{^ P{Y)dY -\- 



(9) 



The first two terms of (Q are actually the negative uLSIF optimization objective 
without regular izat ion. This expression can also be obtained based on the fact that 
the /-divergence D(P\\P') is lower-bounded via the Legendre -Fenchel convex duality 
(iRockafellarl . Il970h as follows JKeziouL 120031 : iNguven et ail l2007h : 



D(P\\P>) 



sup 

h 



p{Y)h(Y) AY - p'(Y)f*(h(Y)) dY 



(10) 



where /* is the convex conjugate of convex function / defined at (J2J). The PE divergence 
corresponds to f(t) — hit — l) 2 , for which convex conjugate is given by f*(t*) 



(t*y 



t*. 



For /(£) = \{t — l) 2 , the supremum can be achieved when ^pY — h{Y) + 1. Substituting 
h(Y) = 4^t ~ l int0 TO, we can obtain ©. 



p'(Y) 



V'(Y) 



uL SIF has some n otable advantages: Its solution can be computed analyti- 



gence rate f|Kanamori et al. 



cally (IKanamori et al.l. 12009) and it possesses the optimal non-parametric conver- 



2012bl ). Moreover, it has the optimal numerical stab ility 



( IKanamori et all 120131 ) , and it is more robust than KLIEP (JSugiyama et all l2012al ) . In 
Section HI we will experimentally demonstrate that uLSIF-based change-point detection 
compares favorably with the KLIEP-based method. 
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3.4 RuLSIF 

Depending on the condition of the denominator density p'(Y), the density-ratio value 

,/ Y \ can be unbounded (i.e., they can be infinity). This is actually problematic because 

the non-parametric convergence rate of uLSIF is governed by the "sup" -norm of the 

true density-ratio function: maxy , lv \ ■ To overcome this problem, relative density-ratio 



estimation was introduced (lYamada et all 1201 3l ). 



_p!J^X 



3.4.1 Relative PE Divergence 

Let us consider the a-relative PE- divergence for < a < 1: 

PE a (P\\P') := PE(P\\aP + (1 - a)P') 



/^($H V 



where p' a {Y) = ap(Y) + (1 — a)p'(Y) is the a-mixture density. We refer to 

p(Y) 



ra ^ ap(Y) + (1 - a)p' (Y) 



as the a-relative density-ratio. The a-relative density-ratio is reduced to the plain density- 
ratio if a = 0, and it tends to be "smoother" as a gets larger. Indeed, one can confirm 
that the a-relative density-ratio is bounded above by 1/a for a > 0, even when the 
plain density-ratio , iv \ is unbound ed. This was proved to contribute to improving the 



estimation accuracy (lYamada et all 120131 ) . 



As explained in Section 13. 1[ we use symmetrized divergence 

PE a (P||P / )+PE a (P / ||P) 
as a change-point score, where each term is estimated separately. 

3.4.2 Learning Algorithm 

For approximating the a-relative density ratio r a (Y), we still use the same kernel model 
g(Y; 6) given by (|5]). In the same way as the uLSIF method, the parameter is learned 
by minimizing the squared loss between true and estimated relative ratios: 

J(Y) = \jjJX) (r a (Y) - g(Y; d)f dY 

= 1 j p' a (Y) r 2 a (Y) dY - J p(Y)r a (Y)g(Y; 6) dY 

+ | Jp(Y)g(Y; Of dY + l —^j p \Y)g{Y- Of dY. 
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Again, by ignoring the constant and approximating the expectations by sample aver- 
ages, the a-relative density-ratio can be learned in the same way as the plain density-ratio. 
Indeed, the optimization problem of a relative variant of uLSIF, called RuLSIF, is given 
as the same form as uLSIF; the only difference is the definition of the matrix H: 

Thus, the advantages of uLSIF regarding the analytic solution, numerical stability, and 
robustness are still maintained in RuLSIF. Further more, RuLSIF possess es an even better 



non-parametric convergence property than uLSIF (jYamada et al.l . 120131 1 . 



3.4.3 Change-Point Detection by RuLSIF 

By using an estimator ~g(Y) of the a-relative density-ratio, the a-relative PE divergence 
can be approximated as 



t?(y-) 2 ' 1 -^±mf+ l „t^)--r 



n -, n 

2n-^— ' JV "' 2n ^-^ Jy 3 ' n 

i=l j=l i=l 



In Section HI we will experimentally demonstrate that the RuLSIF-based change-point 
detection performs even better than the plain uLSIF-based method. 

4 Experiments 

In this section, we experimentally investigate the performance of the proposed and existing 
change-point detection methods on artificial and real-world datasets including human- 
activity sensing, speech, and Twitter messages. The MATLAB implementation of the 
proposed method is available at 



' |http : //sugiyama-www . cs . titech . ac . jp/~ s ong/ change _detect ion/ ' . 



For all experiments, we fix the parameters at n = 50 and k — 10. a in the RuLSIF- 
based method is fixed to 0.1. Sensitivity to different parameter choices and more issues 
regarding algorithm-specific parameter tuning will be discussed below. 

4.1 Artificial Datasets 

As mentioned in Section 13. H we use the symmetrized divergence for change-point detec- 
tion. We first illustrate how symmetrization of the PE divergence affects the change-point 
detection performance. 

The top graph in Figure [3] shows an artificial time-series signal that consists of three 
segments with equal length of 200. The samples are drawn from A/"(0, 2 2 ), A/"(0, l 2 ), and 
A/"(0,2 2 ), respectively, where A/"(/z, a 2 ) denotes the normal distribution with mean /x and 
variance a 2 . Thus, the variances change at time 200 and 400. In this experiment, we 
consider three types of divergence measures: 
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300 



400 
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Figure 3: (Top) The original signal (blue) is segmented into 3 sections with equal length. 
Samples are drawn from the normal distributions A/"(0, 2 2 ), A/"(0, l 2 ), and A/"(0, 2 2 ), respec- 
tively. (Bottom) Symmetric (red) and asymmetric (black and green) PE Q divergences. 



• PE Q (Symmetric) : PE a (P t ||P t+n ) + PE a (P i+n ||P t ), 

• PE Q (Forward):PE a (P t ||P t+n ), 

• PE Q (Backward) : PE a (P i+n ||P t ). 

Three divergences are compared in the bottom graph of Figure [3j 

As we can see from the graphs, PE Q (Forward) detects the first change point success- 
fully, but not the second one. On the other hand, PE a (Backward) behaves oppositely. 
This implies that combining forward and backward divergences can improve the overall 
change-point detection performance. For this reason, we only use PE Q (Symmetric) as the 
change-point score of the proposed method from here on. 

Next, we illustrate the behavior of our proposed RuLSIF-based method, and then 
compare its performance with the uLSIF-based and KLIEP-based methods. In our im- 
plementation, two sets of candidate parameters, 

• a — 0.6d mcd , 0.8d mcd , d mcd , 1.2rJ med , and 1.4d med , 

• A = 10" 3 , 10~ 2 , 10" 1 , 10°, and 10\ 
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are provided to the cross-validation procedure, where <i mc d denotes the median distance 
between samples. The best combination of these parameters is chosen by grid search via 
cross-validation. We use 5-fold cross-validation for all experiments. 

We use the following 4 artificial time-series datasets that contain manually inserted 
change-points: 

• Dataset 1 (Ju mping mean): The following; 1 - dimensional auto-regressive model 
borrowed from iTakeuchi and Yamanishil (120061 ) is used to generate 5000 samples 
(i.e., t = 1,...,5000): 

y(i) = 0.6j/(t-l)-0.5j/(t-2) + e tl 

where e t is a Gaussian noise with mean \x and standard deviation 1.5. The initial 
values are set as y(l) — y(2) — 0. A change point is inserted at every 100 time steps 
by setting the noise mean a, at time t as 



UN 
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N 

N 



JlN-l + 16 

where N is a natural number such that 100(iV 



1. 

2. 



.,49, 
\<t< 1007V. 



Dataset 2 (Scaling variance): The same auto-regressive model as Dataset 1 is 
used, but a change point is inserted at every 100 time steps by setting the noise 
standard deviation a at time t as 



a 



N 



He + f 



N 
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2,4 48. 



Dataset 3 (Switching covariance): 2-dimensional samples of size 5000 are drawn 
from the origin-centered normal distribution, and a change point is inserted at every 
100 time steps by setting the covariance matrix S at time t as 
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Dataset 4 (Changing frequency): 1-dimensional samples of size 5000 are gen- 
erated as 

y(t) = sin(u;x) + e t , 

where et is a origin-centered Gaussian noise with standard deviation 0.8. A change 
point is inserted at every 100 points by changing the frequency u at time t as 



u N 
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Figure 4: Illustrative time-series samples (upper) and 
the change-point score obtained by the RuLSIF-based 
method (lower). The true change-points are marked 
by black vertical lines in the upper graphs. 
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Figure 5: Average ROC curves of 
RuLSIF-based, uLSIF-based, and 
KLIEP-based methods. 
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Table 1: The AUC values of RuLSIF-based, uLSIF-based, and KLIEP-based methods. 
The best and comparable methods by the t-test with significance level 5% are described 
in boldface. 

RuLSIF uLSIF KLIEP 



Dataset 1 
Dataset 2 
Dataset 3 
Dataset 4 



.848(.023) .763(.023) .713(.036) 

.846(.031) .806(.035) .623(.040) 

.972(.012) .943(.015) .904(.017) 

.844(.031) .801(.024) .602(.036) 



Note that, to explore the ability of detecting change points with different significance, 
we purposely made latter change-points more significant than earlier ones in the above 
dataset s. 

Figure H] shows examples of these datasets for the last 10 change points and corre- 
sponding change-point score obtained by the proposed RuLSIF-based method. Although 
the last 10 change points are the most significant, we can see from the graphs that, for 
Dataset 3 and Dataset 4, these change points can be even hardly identified by human. 
Nevertheless, the change-point score obtained by the proposed RuLSIF-based method 
increases rapidly after changes occur. 

Next, we compare the performance of RuLSIF-based, uLSIF-based, and KLIEP-based 
methods in terms of the receiver operating characteristic (ROC) curves and the area under 
the ROC curve (A UC) values. We define th e true positive rate and false positive rate in 
the following way ( iKawahara and Sugiyamal . 120121 ) : 



• True positive rate (TPR): n cr /n cp , 

• False positive rate (FPR): (n al — n cr )/n a i, 

where n cr denotes the number of times change points are correctly detected, n cp denotes 
the number of all change points, and n a i is the number of all dete ction alarms- 
Following t h e st rategy of the previous researches (jDesobry et all 12005 ; 



Harchaoui et al.l . 120091 ). peaks of a change-point score are regarded as detection 



alarms. More specifically, a detection alarm at step t is regarded as correct if there exists 
a true alarm at step t* such that t £ [t* — 10, t* + 10]. To avoid duplication, we remove 
the fcth alarm at step tj~ if tk — tk-i < 20. 

We set up a threshold n for filtering out all alarms whose change-point scores are lower 
than or equal to n. Initially, we set r\ to be equal to the score of the highest peak. Then, 
by lowering rj gradually, both TPR and FPR become non-decreasing. For each n, we plot 
TPR and FPR on the graph, and thus a monotone curve can be drawn. 

Figure O illustrates ROC curves averaged over 50 runs with different random seeds for 
each dataset. Table [1] describes the mean and standard deviation of the AUC values over 
50 runs. The best and comparable methods by the t-test with significance level 5% are 
described in boldface. The experimental results show that the uLSIF-based method tends 
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to outperform the KLIEP-based method, and the RuLSIF-based method even performs 
better than the uLSIF-based method. 

Finally, we investigate the sensitivity of the performance on different choices of n and 
k in terms of AUC values. In Figure El the AUC values of RuLSIF (a = 0.1 and 0.2), 
uLSIF (which corresponds to RuLSIF with a = 0), and KLIEP were plotted for k = 5, 
10, and 15 under a specific choice of n in each graph. We generate such graphs for all 4 
datasets with n = 25, 50, and 75. The result shows that the proposed method consistently 
performs better than the other methods, and the order of the methods according to the 
performance is kept unchanged over various choices of n and k. Moreover, the RuLSIF 
methods with a = 0.1 and 0.2 perform rather similarly. For this reason, we keep using the 
medium parameter values among the candidates in the following experiments: n = 50, 
k = 10, and a = 0.1. 



4.2 Real- World Datasets 

Next, we evaluate the performance of the density-ratio estimation based methods and 
other existing change-point detection methods using two real-world datasets: Human- 
activity sensing and speech. 

We include the following methods in our comparison. 



Singular spectrum transformation (SST) (U Vloskv ina and ZhigljavskyL 
2003al : llde and Tsudal . 120071 : lltoh and Kurthsl . l2010h : Change-point 



scores 

are evaluated on two consecutive trajectory matrices using the distance-based sin- 
gular spectrum analysis. This corresponds to a state-space model with no system 
noise. For this method, we use the first 4 eigenvectors to compare the difference 
between two subspaces, which was confirmed to be reasonable choice in our prelim- 
inary experiments. 



Subspace identification (SI) (JKawahara et al.l . 120071 ): SI identifies a sub- 
space in which time-series data is constrained, and evaluates the distance of target 
sequences from the subspace. The subspace spanned by the columns of an observ- 
ability matrix is used for estimating the distance from the subspace spanned by 
subsequences of time-series data. For this method, we use the top 4 significant 
singular values according to our preliminary experiment results. 



Auto regressive (AR) ( jTakeuchi and Yamanishil . 120061 ): AR first fits an AR 
model to time-series data, and then auxiliary time-series is generated from the AR 
model. With an extra AR model-fitting, the change-point score is given by the log- 
likelihood . The order of the AR model is chosen by Schwarz's Bayesian information 
criterion (jSchwarzl . Il978l ) . 



One-class support vector machine (OSVM) ( IDesobry et al.L 120051 ): 



Change-point scores are calculated by OSVM using two sets of descriptors of sig- 
nals. The kernel width a is set to the median value of the distances between samples, 
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Figure 6: AUC plots for n = 25, 50, 75 and k = 5, 10, 15. 
while the vertical axes denote AUC values. 



The horizontal axes denote k, 
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which is a popular heuristic in kernel methods (jScholkopf and Smolal . l2002l ). An- 
other parameter v is set to 0.2, which indicates the proportion of outliers. 

First, we use a human activity dataset. This is a subset of the Human Activity Sens- 
ing Consortium (HASC) challenge 2011a, which provides human activity information 
collected by portable three-axis accelerometers. The task of change-point detection is to 
segment the time-series data according to the 6 behaviors: "stay" , "walk" , "jog" , "skip" , 
"stair up" , and " stair down" . The starting time of each behavior is arbitrarily decided 
by each user. Because the orientation of accelerometers is not necessarily fixed, we take 
the ^2-norm of the 3-dimensional (i.e., x-, y-, and z-axes) data. 

In Figure 7(a) , examples of original time-series, true change points, and change-point 
scores obtained by the RuLSIF-based method are plotted. This shows that the change- 
point score clearly captures trends of changing behaviors, except the changes around time 
1200 and 1500. However, because these changes are difficult to be recognized even by 



human, we do not regard them as critical flaws. Figure 7(b) illustrates ROC curves aver- 



aged over 10 datasets, and Figure 7(c) describes AUC values for each of the 10 datasets. 



The experimental results show that the proposed RuLSIF-based method tends to perform 
better than other methods. 

Next, we use the IPS J SIG-SLP Corpora and Environments for Noisy Speech Recog- 
nition (CENSREC) dataset provided by National Institute of Informatics (Nlljj, which 
records human voice in a noisy environment. The task is to extract speech sections from 
recorded signals. This dataset offers several voice recordings with different background 
noises (e.g., noise of highway and restaurant). Segmentation of the beginning and ending 
of human voice is manually annotated. Note that we only use the annotations as the 
ground truth for the final performance evaluation, not for change-point detection (i.e., 
this experiment is still completely unsupervised). 

illustrates an example of the original signals, true change-points, and 



Figure 8 (a 



change-point scores obtained by the proposed RuLSIF-based method. This shows that 



the proposed method still gives clear indications for speech segments. Figure 8(b) and 
Figure 8(c) show average ROC curves over 10 datasets and AUC values for each of the 



10 datasets. The results show that the proposed method significantly outperforms other 
methods. 



4.3 Twitter Dataset 

Finally, we apply the proposed change-point detection method to the CMU Twitter 
datasevtH, which is an archive of Twitter messages collected from February 2010 to October 
2010 via the Twitter application programming interface. 

Here we track the degree of popularity of a given topic by monitoring the frequency of 
selected keywords. More specifically, we focus on events related to u Deepwater Horizon oil 





http : //hasc . jp/hc2011/| 
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Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation 19 



Original signal 




ill II 



ailflilMMIBIilii 




1500 



2000 




1500 



2000 



Time 



(a) One of the original signals and change-point scores obtained by the RuLSIF-based 

method 




0.2 0.4 0.6 0.8 

False Positive Rate 

(b) Average ROC curves 



ID 


RuLSIF 


uLSIF 


KLIEP 


AR 


SI 


SST 


OSVM 


1001 


.974 


.853 


.838 


.899 


.958 


.903 


.900 


1002 


.996 


.963 




909 


.872 


.969 


.880 


.905 


1003 


.989 


.854 




929 


.869 


.895 


.851 


.937 


1004 


.996 


.868 




890 


.881 


.941 


.886 


.891 


1005 


.938 


.952 




972 


.849 


.972 


.915 


.943 


1006 


.933 


.918 




889 


.778 


.890 


.925 


.842 


1007 


.972 


.857 




834 


.850 


.941 


.817 


.891 


1008 


.995 


.922 




930 


.892 


.981 


.860 


.907 


1009 


.987 


.880 




907 


.833 


.979 


.842 


.951 


1010 


.991 


.952 




889 


.821 


.915 


.867 


.903 


Ave. 


.977 


.902 




900 


.854 


.944 


.875 


.907 


Std. 


.024 


.044 




042 


.037 


.034 


.034 


.032 



(c) AUC values. The best and comparable methods by the t-tcst with 
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Figure 7: HASC human-activity dataset. 
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(c) AUC values. The best and comparable methods by the t-test with 
significance level 5% are described in boldface. 

Figure 8: CENSREC speech dataset. 
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Figure 9: Twitter dataset. 



spill in the Gulf of Mexico" which occurred on April 20, 201Qj, and was widely broadcast 
among the Twitter community. We use the frequencies of 10 keywords: "gulf" , "spill" , 
u bp" , u oil", "hayward" , " mexico" , "coast", "transocean" , "halliburton" , and "obama" 



(see Figure 9(a)). We perform change-point detection directly on the 10- dimensional 
data, with the hope that we can capture correlation changes between multiple keywords, 
in addition to changes in the frequency of each keyword. 

For quantitative evaluation, we referred to the Wikipedia entry "Timeline of the Deep- 
water Horizon oil spill"u as a real-world event source. The change-point score obtained 



by the proposed RuLSIF-based method is plotted in Figure 9(b), where four occurrences 



of important real-world events show the development of this news story. 



As we can see from Figure 9(b) , the change-point score increases immediately after the 



initial explosion of the deepwater horizon oil platform and soon reaches the first peak when 
oil was found on the sea shore of Louisiana on April 30. Shortly after BP announced its 
preliminary estimation on the amount of leaking oil, the change-point score rises quickly 
again and reaches its second peak at the end of May, at which time President Obama 



http : //en . wikipedia . org/wiki/Deepwater_Horizon_oil_spill 
'http: //en. wikipedia. org/wiki/Timeline_of _the_Deepwater_Horizon_oil_spill 
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visited Louisiana to assure local residents of the federal government's support. On June 
25, the BP stock was at its one year's lowest price, while the change-point score spikes at 
the third time. Finally, BP cut off the spill on July 15, as the score reaches its last peak. 



5 Conclusion and Future Perspectives 

In this paper, we first formulated the problem of retrospective change-point detection as 
the problem of comparing two probability distributions over two consecutive time seg- 
ments. We then provided a comprehensive review of state-of-the-art density-ratio and 
divergence estimation methods, which are key building blocks of our change-point detec- 
tion methods. Our contributions in this paper were to extend th e existing KLIEP-based 
change-point detection method (IKawahara and Sugiyamal . |2012j), and to propose to use 
uLSIF as a building block. uLSIF has various theoretical and practical advantages, for 
example, the uLSIF solution can be computed analytically, it possesses the optimal non- 
parametric convergence rate, it has the optimal numerical stability, and it has higher ro- 
bustness than KLIEP. We further proposed to use RuLSIF, a novel divergence estimation 
paradigm emerged in the machine learning community recently. RuLSIF inherits good 
properties of uLSIF, and moreover it possesses an even better non-parametric convergence 
property. Through extensive experiments on artificial datasets and real-world datasets 
including human- activity sensing, speech, and Twitter messages, we demonstrated that 
the proposed RuLSIF-based change-point detection method is promising. 

Though we estimated a density ratio b e tween two consecutive s egments, some earlier 
researches (IBasseville and Nikiforovl . Il993l ; iGustafssonl . Il996l . |2000| ) introduced a hyper- 
parameter that controls the size of a margin between two segments. In our preliminary 
experiments, however, we did not observe significant improvement by changing the margin. 
For this reason, we decided to use a straightforward model that two segments have no 
margin in between. 

Through the experiment illustrated in Figure [6] in Section 14.11 we can see that the 
performance of the proposed method is affected by the choice of hyper-parameters n and 
k. However, discovering optimal values for these parameters remains a challenge, which 
will be investigated in our future work. 

RuLSIF w a s sho wn to possess a better convergence property than uLSIF 
(jYamada et all 120131 ) in terms of density ratio estimation. However, how this theo- 



retical advantage in density ratio estimation can be translated into practical performance 
improvement in change detection is still not clear, beyond the intuition that a better di- 
vergence estimator gives a better change score. We will address this issue more formally 
in the future work. 

Although the proposed RuLSIF-based change-point detection was shown to work well 
even for multi-dimensional time-series data, its accuracy may be further improved by 
incorporating dimensionality reduction. Recently, several attemp ts were made to com- 
bine d i mensionality reduction with di rect density-ratio estimation (jSugiyama et all |2010| . 
2011bl ; lYamada and Sugiyamal . 1201 ll ). Our future work will apply these techniques to 
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change-point detection and evaluate their practical usefulness. 

Compared with other approaches, methods based on density ratio estimation tend to 
be computationally more expensive because of the cross-validation procedure for model 
selection. However, thanks to the analytic solution, the RuLSIF- and uLSIF-based meth- 
ods are computationally more efficient than the KLIEP-based m e thod that requires an 
iterative optimization procedure (see Figure 9 in iKanamori et al.l (120091 ) for the detailed 
time comparison between uLSIF and KLIEP). Our important future work is to further 
improve the computational efficiency of the RuLSIF-based method. 

In this paper, we focused on computing the change-point score that represents the 
plausibility of change points. Another possible formulation is hypothesis testing, which 
provides a useful threshold to determine whether a point is a change point. Method- 
ologically, it is straightforward t o extend the proposed method to produ c e the p -values, 



following the recent literatures (jSugiyama et al.l . l2011al ; IKanamori et al.l . l2012al ). How- 
ever, computing the p-value is often time consuming, particularly in a non-parametric 
setup. Thus, overcoming the computational bottleneck is an important future work for 
making this approach more practical. 

Recent repo r ts pointed out that Twitte r messages can be indicative of real- world events 
( Petrovic et al.l . |2010| ; ISakaki et al.l . 120101 ) . Following this line, we showed in Section 14.31 
that our change-detection method can be used as a novel tool for analyzing Twitter 
messages. An important future challenge along this line includes automatic keyword 
selection for topics of interests. 
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